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Abstract 

This paper proposes the ^-generalized distribution as a model for describing the 
distribution and dispersion of income within a population. Formulas for the shape, 
moments and standard tools for inequality measurement — such as the Lorenz curve 
and the Gini coefficient — are given. A method for parameter estimation is also 
discussed. The model is shown to fit extremely well the data on personal income 
distribution in Australia and the United States. 
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1 Introduction 

In the analysis of income distributions, analysts have found it useful to have 
distributional summaries based on estimates of specific parametric functional 
forms, not only for their suitability in modelling some features of many em- 
pirical income distributions, but also because of their role as equilibrium dis- 
tributions in economic processes [T]. 

* Corresponding author: Tel.: +39-071-22-07-103; fax: +39-071-22-07-102. 

Email addresses: fabio.clementi@univpm.it (F. Clementi), 
tiziana.dimatteo@anu.edu.au (T. Di Matteo), mauro.gallegati@univpm.it 
(M. Gallegati), giorgio.kaniadakis@polito.it (G. Kaniadakis). 



Preprint submitted to Physica A 



8 December 2008 



Vilfredo Pareto first proposed a model of income distribution in the form of a 
probability density function in 1897 [2] , providing a description of the density 
for income values above some lower bound, x$ > [3j. If one focuses on 
the distribution amongst those with income greater than x$, there are simple 
expressions for the moments which depend only on the Pareto parameters 
a and xq. Moreover, the expressions for most common inequality measures 
depend only on a, so that the (inverse of) a may also be considered as an 
inequality measure. 

However, the apparent attractions of the Pareto distribution evaporate some- 
what when one considers its implications for the distribution of income amongst 
the population as a whole, i.e. including units with income less than Xq. For 
example, Ref. [I] shows how the expression for the Gini coefficient depends 
on assumptions about the size of excluded population (i.e. the proportion of 
the population with income below x ) an d its average income. In particular, a 
no longer has such a straightforward interpretation. For example, an increase 
in a may be associated with a decrease in inequality according to the Gini 
coefficient, but an increase according to the coefficient of variation. 

Later empirical studies showed that the income distribution is right-skewed 
and has a fat right-hand tail, and that the Pareto distribution accurately 
models only high levels of income, but does a poor job in describing the lower 
end of the distribution. As research continued, new models were proposed 
that better describe the data, using both a combination of known statistical 
distributions [5] and parametric functional forms for the distribution of income 
as a whole, including two-parameter models such as the lognormal and gamma, 
three-parameter distributions such as the Singh-Maddala and Dagum I, and 
four-parameter distributions such as the generalized beta distributions of the 
first and second kind (see the comprehensive survey by Ref. [6]). 

The Pareto fat tails was observed experimentally also in physical statistical 
systems, and are located in the high energy region where the laws of classical 
physics are replaced by the relativistic ones. After 2001, a physical mecha- 
nism emerging in the context of special relativity was proposed by one of us 
[7], predicting a deformation of the exponential function. According to this 
mechanism, the classical exponential distribution transforms into a new dis- 
tribution, which at high energies presents a Pareto fat tail. More precisely, 
this mechanism deforms the ordinary exponential function exp (x) into the 
generalized exponential function exp K (x) given by 



The above deformation is generated by the fact that the propagation of the 
information has a finite speed, and the deformation parameter k is propor- 
tional to the reciprocal of this speed. The ^-generalized exponential has the 
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important properties 



exp K (x) ~ \2kx\ W , (2a) 

x— >±oo 

exp K (x) ~ exp (x) . (2b) 

x^O 



It is remarkable that for classical systems where the information propagates 
instantaneously it results k = 0, so that the ordinary exponential emerges 
naturally after noting that exp (x) = exp(x). Moreover, in the low energy 
region x — ► according to Eq. f!2bl) the exponential distribution emerges again, 
because the system behaves classically. On the contrary, in systems where the 
information propagates with a finite speed — these systems are intrinsically 
relativistic — it results k ^ 0, so that the exponential tails become fat according 
to Eq. (l2al) and the Pareto law emerges. 

The generalized exponential represents a very useful and powerful tool to 
formulate a new statistical theory capable to treat systems described by dis- 
tribution functions exhibiting power-law tails and admitting a stable entropy 
[8ll9] . Furthermore, non-linear evolution models already known in statistical 
physics [10J can be easily adapted or generalized within the new theory. 

After 2001, the function exp K (x) was adopted successfully in the analysis 
of various physical and non physical systems. In Ref. [IT] we have used the 
function exp K (x) to model the personal income distribution by defining the 
Complementary Cumulative Distribution Function (CCDF) through 

P> (x) = exp K (-/3x a ) , xeR + , a,/3>0, k e [0, 1), (3) 

where the income variable x is defined as x = -Ar, being z the absolute per- 
sonal income and (z) its mean value. The corresponding Probability Density 
Function (PDF) reads 

a(3x a ~ 1 exp K (-/3x a ) 

Follows immediately that in this model for low incomes the CCDF behaves as a 
stretched exponential P> (x) = exp (— j3x a ), while at high incomes follows the 

Pareto law P> (x) = (2/3 k) k x~~ . Similarly, the PDF for x — > + behaves as 
a Weibull distribution p (x) = afix " 1 exp (— /?x a ), while for x — > +oo reduces 

to the Pareto's law p(x) = - (2[3k)~~ x~(~ +1 ). 

Starting from the definitions in Eqs. Q and (jl]), in this work we derive the ba- 
sic statistical properties of the proposed distribution along with common tools 
that are required for income distribution analysis; these include, among oth- 
ers, the ubiquitous Lorenz curve and the associated Gini measure of inequality. 
The basic proposition of this paper is that the ^-distribution provides a very 
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good description of the size distribution of income, ranging from the low re- 
gion to the middle region, and up to the power-law tail, and the inequality 
analysis expressed in terms of its parameters reveals very powerful. 

The content of the paper is organized as follows: Sec. [2] includes a discussion 
of the /t-distribution and reports formulas which are useful in the estimation 
and analysis of empirical data. Sec. [3] illustrates applications of the results to 
Australian and US household survey data. Sec. 0] concludes. 



2 The ^-generalized distribution 



2. 1 Basic properties 



Using the complementary relation P< (x) = 1 — P > (x), we see that the quantile 
function is available in closed form 



x = P- 1 fa) = p-Z 



1 - u 



< u < 1, 



(5) 



a property that facilitates the derivation of Lorenz-ordering results (see Sec. 
12. 3p . From Eq. (jSJ) we easily determine that the median of the distribution is 

* med = /H [log, (2)]-. 
The mode is at 



•^mode P 



a + 2k 2 (a - 1) 
2k 2 (a 2 - k 2 ) 



1 + 



4/t 2 (a 2 - h?) (a - 1) 



i 

2 ci 



\ [a 2 + 2K 2 {a-l)\ 



(6) 



if a > 1; otherwise, the distribution is zero-modal with a pole at the origin. 



2.2 Moments and related parameters 



The r th -order moment about the origin of the K-generalized distribution equals 



(2/3k)~£ 


r 


( i 

\2k 


2a J 


1 + —K, 

a 


r 




JL) 
2a J 



^= x*p{x)dx= \"' r ; Q r 1 + - , (7) 



r 



a 



where T (x) is the Gamma function T (x) = J °° t x l e t dt. Specifically, fi 1 = m 
is the mean of the distribution. 
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A formula for the variance is obtained by converting Eq. (JTj) to the moment 



about the mean using the general equation fi r = X^=o (j) 



hence, for r = 2 we have 



a' 



.2 Jr (i 



r 



2k 



1 + 2^ 



r(i + ^)r(i 



2a 



1 + £ 

a 



In this way it is also possible to define the standardized moments of the dis- 
tribution, which are in turn used to define skewness and excess kurtosis, re- 
spectively given by 

7l = ^= ^" 3m f (9) 

and 

/i4 /i 4 — 3<t 4 — 47iO" 3 m — 6cr 2 m 2 — m 4 . . 

72 = ^-3 = -j . (10) 



2.3 Lorenz curves and inequality measures 



For a discussion of income inequality, the standard practice adopts the concept 
of concentration of incomes as defined by Lorenz [12]. The so-called Lorenz 
curve measures the cumulative fraction of population with incomes below x 
along the horizontal axis, and the fraction of the total income this population 
accounts for along the vertical axis. The points plotted for the various values 
of x trace out a curve below the 45° line sloping upwards to the right from 
the origin. 



In statistical terms, for any general distribution supported on the nonnega- 
tive half-line with a finite and positive first moment the Lorenz curve can 
be written in the form L (u) = J " P< 1 (t) dt/ Jq P< 1 (u) da, m 6 [0, 1], where 
m = Jq P< 1 (u) du is the quantile formula for the mean, and P< 1 (u) is the 
quantile function given by Eq. (JSJ0- Thus, we have the Lorenz curve for the 
^-generalized distribution as follows 



[u 



i _l « r (— + — 




t s ~ l (1 - ty 1 dt with X = (1 



u 



\2k 



1 See e.g. Ref. p3]. 



+ 



11^ 



where B x (s,r) is the incomplete Beta function given by B x (s,r) 
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The Gini measure of income inequality [H] can be derived using the repre- 
sentation G = 1 — — J °° [1 — P< (x)} 2 dx given by Ref. [15]; it follows that the 
Gini coefficient for the ^-generalized distribution is 



G K = 1- 



2a + Ik T 



2a 



\k~ 2aJ \2k 2a J 



'121 



Furthermore, relating the standard deviation to the mean yields the following 
expression for the coefficient of variation 



CV K 



a 
m 



\ 



r i- 



. 2k ' 



i+ 2 5 r i+i 



r(i+£)r(£-£) 



i 2 



(13) 



Estimation 



Parameter estimation for the ^-generalized distribution can be performed us- 
ing the Maximum Likelihood (ML) approach. Assuming that all observations 
x = {xi, . . . , x n } are independent, the likelihood function is 

L (0; x) = f[ P (x t ) = (a/3)" ft X \ eXPK }~ff\ (14) 



where = {a, (3, k} is the parameter vector. This leads to the problem of solv- 
ing the partial derivatives of the log-likelihood function / (0; x) = logL (6; x) 
with respect to k, a and (3. However, obtaining explicit expressions for the 
ML estimators of the three parameters is difficult, making direct analytical 
solutions intractable, and one needs to use numerical optimization methods. 

Taking into account the meaning of the variable x, the mean value results to 
be equal to unity, i.e. m = f£°xp(x)dx = 1. The latter relationship permits 
to express the parameter (3 as a function of the parameters k and a, obtaining 



r 



j i_ 

2k 2a 



K + a 



\2k ~ 2a 



(15) 



In this way, the problem to determine the values of the free parameters 
{k, at, (3} of the theory from the empirical data reduces to a two parameter 
{k, a} fitting problem. Therefore, to find the parameter values that give the 
most desirable fit, one can use the Constrained Maximum Likelihood (CML) 
estimation method P2], which solves the general maximum log-likelihood 
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problem of the form Z (0; x) = Yh=i logp (xf, 6) Wl , where n is the number 
of observations, Wi the weight assigned to each observation, p (x{, 6) the prob- 
ability of Xi given 0, subject to the non-linear equality constraint given by 
Eq. (IToT) and bounds a, (5 > and k G [0, 1). The CML procedure finds values 
for the parameters in such that Z (0; x) is maximized using the sequential 
quadratic programming method [T7] as implemented, e.g., in M atlab® 7. 



3 Empirical application to income data 



The ^-generalized distribution was fit to data on personal income distribution 
for Australia and the United Stated" 2 "! The data are derived from panel sur- 
veys conducted in 2002-03 and 2003, respectively. The unit of assessment is 
the household, and income is expressed in nominal local currency units (and 
is equivalized for differences in household size by adjusting by the square root 
of the number of household members [IB])- There are 10,211 households in the 
2002-03 Australian survey, and 7,822 in the 2003 US survey All calculations 
use the sampling weights produced by the data provider [19]. We consider 
the distributions of disposable income, i.e. the income recorded after the pay- 
ment of taxes and government transfers. In the data analysis, we exclude the 
observations with zero and negative values, and normalize income to its em- 
pirical average, given by 32, 891.17±343.58 AUD and 31, 812.39±598.74 USD 
respectively.!!. 



Maximum likelihood estimates are shown in panels (a) and (b) of Figs. [T] 
and [2j All the parameters were very precisely estimated, and the comparison 
between the fitted and sample estimates of the CCDF and PDF suggests that 
the ^-generalized distribution offers a great potential for describing the data 
over their whole range, from the low to medium income region through to the 
high income Pareto power- law regime, including the intermediate region for 
which a clear deviation exists when two different curves are used. 



Panel (c) of the same figures depicts the data points for the empirical Lorenz 
curve, i.e. L (^j = J2}=i Xj/ J2]=i x ji i — 1,2, ...,n, superimposed by the 



2 These data were not studied in the previous paper [TT], where the emphasis was 
on other countries. However, our main findings here have been applied also to the 
data included in that work, and the results are available from the authors upon 
request. 

3 More detailed information on the Australian data is available on the 



Australian Bureau of Statistics (ABS) web site: http://www.abs.gov.au 



For analyses referring to the same country and data source, see Refs. 
[20j . For the US data, see Refs. [211 . or consult the following web address: 



http : //www . human . Cornell . edu/ che/PAM/Research/Centers-Programs/German-Panel/ cnef . cfm 
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Fig. 1. The Australian personal income distribution in 2002-03 measured in current 



year AUD. (a) Plot of the empirical CCDF in the log-log scale. The solid line is our 
theoretical model given by Eq. ([3]) fitting very well the data in the whole range from 
the low to the high incomes including the intermediate income region. This function 
is compared with the ordinary stretched exponential one (dotted line) — fitting the 
low income data — and with the pure power-law (dashed line) — fitting the high in- 



come data, (b) Histogram plot of the empirical PDF with superimposed fits of the 



/•c-generalized (solid line) and Weibull (dotted line) PDFs. (c) Plot of the Lorenz 
curve. The hollow circles represent the empirical data points and the solid line is 
the theoretical curve given by Eq. (jlll) using the same parameter values as in panels 
(a) and |(b) | The dashed line corresponds to the Lorenz curve of a society in which 
everybody receives the same income and thus serves as a benchmark case against 
which actual income distribution may be measured. |(d)| Q-Q plot of the sample 
quantiles versus the corresponding quantiles of the fitted K-generalized distribution. 
The reference line has been obtained by locating points on the plot corresponding to 
around the 25 th and 75 th percentiles and connecting these two. In plots [(a)] |(b)| and 
|(d)| the income axis limits have been adjusted according to the range of data to shed 
light on the intermediate region between the bulk and the tail of the distribution. 

theoretical curve L K (u) given by Eq. (TTT1) with estimates replacing a and 
k as necessary. This formula is shown by the solid line in the plots, and 
fits the data exceptionally well. The plots also exhibit a very good agree- 
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Cumulative proportion of income receivers (%) Sample quantiles 

(c) Lorenz curve (d) Q-Q plot 

Fig. 2. Same plots as in Fig. [1] for the US personal income distribution in 2003. The 
income variable is measured in current year USD. 

ment between the empirical estimates of the Gini coefficient, obtained as 
G = ^j- Y%=i (2i — n — 1) Xi, and the values returned by the analytical ex- 
pression given by Eq. (|12p for the estimated ^-generalized distribution; the 
95% confidence intervals constructed around the values of G always cover the 
theoretical predictions G K 6, 7 . 

In order to further evaluate the accuracy of our distributional model, we 
have also tested the hypothesis that each set of n observed data follows 
a K-generalized distribution by calculating the Kolmogorov-Smirnov (K-S) 
goodness-of-fit test statistic given by D + = maxi<j<„ [m _1 — P< (xi)], i = 
1,2, ... ,n. Since in this case there is no asymptotic formula for calculating the 
p-value, we have reduced the problem to testing that the x values have a stan- 
dard exponential distribution (i.e., an exponential distribution with parameter 

6 For the formulas used to estimate the empirical Lorenz curve and Gini coefficient 
see e.g. Refs. [22] and [23], respectively. 

7 The confidence intervals for the observed Gini coefficients have been calculated 
via the bootstrap resampling method based on 1000 replications. For general details 
about bootstrapping, see Refs. [24] . 



9 



equal to 1) by relating the function P > (x) given by Eq. ({TBI to the ordinary ex- 
ponential function, namely exp K (— (3x a ) = exp (— x K ), through the transforma- 
tion x K = i log + f3 2 K 2 x' 2a + (3K,x a ^J , where the parameters are estimated 
from the data. Thus the significance level in the upper tail is given approxi- 
matively by P> (T*) = exp [-2 (T*) 2 J , with T* = D + + 0.12 + 0.11/ y/n), 
as suggested for example by Ref. [25]. The results are shown in the upper-left 
corner of panel (d) of Figs. [1] and [3 As one can appreciate, the maximum 
distance between the empirical data and the theoretical model as assessed by 
the K-S statistic is very small, and the p-values in parentheses do not lead to 
rejection of the null hypothesis that the data may come from a ^-generalized 
distribution at any of the usual significance levels (1%, 5% and 10%). The lin- 
ear behaviour emerging from the Quantile-Quantile (Q-Q) plots of the sample 
quantiles versus the corresponding quantiles of the fitted K-generalized distri- 
bution displayed in the same panel strongly supports the quantitative results 
obtained by hypothesis testing. 



4 Summary and conclusions 



One of the main objectives of research on income distribution is to provide a 
mathematical description of the size distribution of income for approximating 
the underlying "true" distribution. Starting from Pareto contribution, a wide 
variety of functional forms have been considered as possible models for the 
distribution of personal income by size, and other approaches can no doubt 
be suggested and deserve attention. 

In this work we have proposed a three-parameter distribution by using a new 
approach having its root in the framework of the ^-generalized statistical me- 
chanics. This model shows able to describe the entire income range, including 
the Pareto upper tail, and fits the Australian and US income data extremely 
well. The analysis of inequality performed in terms of its parameters reveals 
the merit of the proposed distribution, and provide the basis for a fruitful 
interaction between the two fields of statistical mechanics and economics. 
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